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Abstract. We focus on range query processing on large-scale, typi- 
cally distributed infrastructures, such as clouds of thousands of nodes of 
shared-datacenters, of p2p distributed overlays, etc. In such distributed 
environments, efficient range query processing is the key for managing 
the distributed data sets per se, and for monitoring the infrastructure's 
resources. We wish to develop an architecture that can support range 
queries in such large-scale decentralized environments and can scale in 
terms of the number of nodes as well as in terms of the data items stored. 
Of course, in the last few years there have been a number of solutions 
(mostly from researchers in the p2p domain) for designing such large- 
scale systems. However, these are inadequate for our purposes, since at 
the envisaged scales the classic logarithmic complexity (for point queries) 
is still too expensive while for range queries it is even more disappoint- 
ing. In this paper 4 we go one step further and achieve a sub-logarithmic 
complexity. We contribute the ART 5 structure, which outperforms the 
most popular decentralized structures, including Chord (and some of 
its successors), BATON (and its successor) and Skip-Graphs. We con- 
tribute theoretical analysis, backed up by detailed experimental results, 
showing that the communication cost of query and update operations is 
0(log^ log TV) hops, where the base b is a double-exponentially power of 
two and N is the total number of nodes. Moreover, ART is a fully dy- 
namic and fault-tolerant structure, which supports the join/leave node 
operations in O(loglogiV) expected w.h.p number of hops. Our experi- 
mental performance studies include a detailed performance comparison 
which showcases the improved performance, scalability, and robustness 
of ART. 
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4 A limited and preliminary version of this work has been presented as brief announce- 
ment in Twenty-Ninth Annual ACM SIGACT-SIGOPS Symposium on Principles of 
Distributed Computing, Zurich, Switzerland July 25-28, 2010 [28] 
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1 Introduction and Motivation 

Decentralized range query processing is a notoriously difficult problem to solve 
efficiently and scalably in decentralized network infrastructures. It has been stud- 
ied in the last years extensively, particularly in the realm of p2p, which is in- 
creasingly used for content delivery among users. There are many more real-life 
applications in which the problem also materializes. Consider the (popular nowa- 
days) cloud infrastructures for content delivery. Monitoring of thousand of nodes, 
where thousands of different applications from different organizations execute, 
is an apparent requirement. This monitoring process often requires support for 
range queries over this decentralized infrastructure: consider range queries that 
are issued in order to identify which of the cloud nodes are under- utilized, (i.e., 
utilization < threshold) in order to assign to them more data & tasks and 
better exploit all available resources, increasing the revenues of the cloud infras- 
tructure, or to identify overloaded nodes, (load > threshold) in order to avoid 
bottlenecks in the cloud, which hurts overall performance, and revenues. 

Each node in the cloud maintains a tuple with attributes: utilization, OS, 
load, Nodcld, e.t.c. Collectively, these makeup a relation, CloudNodes, and we 
wish to execute queries such as: 
SELECT Nodeld 
FROM Cloudnodes 
WHERE low < utilization < high 

or point and range queries, e.g. 
SELECT Nodeld 
FROM Cloudnodes 

WHERE low < utilization < high and OS=UNIX 

An acceptable solution for processing range queries in such large-scale de- 
centralized environments must scale in terms of the number of nodes as well as 
in terms of the number of data items stored. The available solutions for archi- 
tecting such large-scale systems are inadequate for our purposes, since at the 
envisaged scales (trilions of data items at millions of nodes) the classic logarith- 
mic complexity (for point queries) offered by these solutions is still too expensive. 
And for range queries, it is even more disappointing. Further, all available so- 
lutions incur large overheads with respect to other critical operations, such as 
join/leave of nodes, and insertion/deletion of items. Our aim with this work is 
to provide a solution that is comprehensive and outperforms related work with 
respect to all major operations, such as lookup, join/leave, insert/delete, and to 
the required routing state that must be maintained in order to support these op- 
erations. Specifically, we aim at achieving a sub-logarithmic complexity for all 
the above operations! 

Peer-to-peer (P2P) systems have become very popular, in both academia 
and industry. They are widely used for sharing resources like music files etc. 
Search for a given ID, is a crucial operation in P2P systems, and there has been 
considerable recent work in devising effective distributed search (a.k.a. lookup) 
techniques. The proposed structures include a ring as in Chord [15], a multiple 
dimensional grid as in CAN [22], a multiple list as in SkipGraph [2,10], or a 
tree as in PHT [24], BATON [13] and BATON* [14]. Most search structures 
(including all the ones just mentioned except for BATON* and PHT) bound 
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the search cost to a base 2 logarithm of the search space: for a system with 
N peer nodes, the search cost is bounded by O(logA). Relative to tree-based 
indexes, a disadvantage of PHTs (Prefix Hash Trees) is that their complexity 
is expressed in terms of the log of the domain size, D, rather than the size of 
the data set, A and depends on distribution over D — bit keys. BATON* is a 
multi-way search tree, which reduces the search cost to 0(log m A), where m is 
the tree fanout. The penalty paid is a larger update cost, but no worse than 
linear in m. One of the distributed indexes with high fanout is the P-Tree [5], 
where each peer maintains a B+-tree leaf and a path of virtual index nodes from 
the root to the specific leaf. Search is very effective, but updates are expensive, 
possibly requiring substantial synchronization effort. BATON* extends BATON 
by allowing a fanout of m > 2. Thus, the search cost becomes 0(log m A), as 
expected. Moreover, the cost of updating routing tables is 0{m log m A) only, 
as compared to (9(log 2 A) in BATON - an improvement that is better than 
linear in m. Furthermore, BATON* has better fault tolerance properties than 
BATON, and supports load balancing more efficiently. In fact, the system's fault 
tolerance, measured as the number of nodes that must fail before the network 
is partitioned, increases linearly with m. Similarly, the expected cost of load 
balancing decreases linearly with m. 

Our Results: In this paper we present the ART structure, which outper- 
forms the most popular decentralized structures, including Chord (and some of 
its successors), BATON and BATON* and Skip-Graphs. ART is an exponential- 
tree structure, which remains unchanged w.h.p., and organizes a number of fully- 
dynamic buckets of peers. We provide and analyze all relevant algorithms for 
accessing ART. We contribute theoretical analysis, backed up by detailed ex- 
perimental results, showing that the communication cost of query and update 
operations is O(log^logA) hops, where the base 6 is a double-exponentially 
power of two. Moreover, ART is a fully dynamic and fault-tolerant structure, 
which supports the join/leave node operations in O(loglogA) expected w.h.p 
number of hops. Since ART is a tree based system, our experimental perfor- 
mance studies include our development of BATON* (the best current tree based 
system) , and a detailed performance comparison which showcases the improved 
performance, scalability, and robustness of ART. 

In Section 2 we present more thoroughly key previous work. Section 3 de- 
scribes the ART structure and analyzes its basic functionalities. Section 4 presents 
a thorough experimental evaluation; Section 5 presents some interesting heuris- 
tics and thresholds, whereas Section 6 concludes the paper. 

2 Previous Work 

Existing structured P2P systems can be classified into three categories: dis- 
tributed hash table (DHT) based systems, skip list based systems, and tree based 
systems. There are several P2P DHT architectures like Chord [15], CAN [22], 
Pastry [23], Tapestry [31], Kademlia [20] and and Kelips [9]. Unfortunately, these 
systems cannot easily support range queries since DHTs destroy data ordering. 
This means that they cannot support common queries such as "find all research 
papers published from 2004 to 2008" . To support range queries, inefficient DHT 
variants have been proposed (for details see [8], [25], [1], [29]). 
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Skip list based systems such as Skip Graph [2, 10] and Skip Net [12] are 
based on the skip-list structure. To provide decentralization they use randomized 
techniques to create and maintain the structure. Moreover, they can support 
both exact match queries and range queries by partitioning data into ranges of 
values. However, they cannot guarantee data locality (which hurts efficient range 
query processing) and load balancing in the system. 

Tree based systems also carry their own disadvantages. P-Grid [5] utilizes a 
binary prefix tree. It can neither guarantee the bound of search steps since it 
cannot control the tree height. An arbitrary multi-way tree was proposed in [19], 
where each node maintains links to its parent, children, sibling and neighbors. 
It also suffers from the same problem. P-Tree [5] utilizes a B + -tree on top of the 
CHORD overlay network, and peers are organized as a CHORD ring, each peer 
maintaining a data leaf and a left most path from the root to that B + -tree node. 
This results in significant overhead in building and maintaining the consistency 
of the B + -tree. In particular, a tree has been built for each joining node, and 
periodically, peers have to exchange their stored B + -tree for checking consistency. 
BATON [13] utilizes a binary balanced tree and as a consequence, it can control 
the tree height and avoid the problem of P-Grid. Nevertheless, similarly to other 
P2P systems, BATON'S search cost is bounded by 0(log 2 N). BATON* [14] is 
an overlay multi-way tree based on B-trees, with better searching performance. 
The penalty paid is a marginally larger update cost. 

Systems like MAAN [4], Mercury [3] and DIM [18] support multi-attribute 
queries in a multi-dimensional space. BATON* can also effectively support queries 
over multiple attributes. In addition to supporting the use of multiple attributes 
in a single index, BATON* further introduces the notion of attribute classifica- 
tion, based on the importance of the attribute for querying, and the notion of 
attribute groups. In particular, BATON* relies on the construction of multiple 
independent indexes for groups of one or more attributes. For further details 
about the suggested techniques for partitioning attributes into such groups, see 
[14]. 
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CHORD 


O(logTV) 


(log N) 


O(logiV) w.h.p. 


H-F-Chord(a) 


o(iogjvyiogiogjv) 


O (log N) 


O(logTV) 


LPRS-Chord 


O(logiV) 


O(logiV) 


O(logJV) 


Skip Graphs 


O(logN) 


O(l) 


0(log N) amortized 


BATON 


O(logiV) 


(log N) 


O(logiV) w.h.p. 


BATON* 


0(log m N) 


0(m log m N) 


0(m log m N) 


ART-tree 


0(logt logN) 


0(N^ 4 /log c N) 


O(loglogiV) expected w.h.p. 



Table 1. Performance comparison between ART, Chord, BATON and Skip Graphs. 



For comparison purposes, in Table 1 we present a qualitative evaluation with 
respect to elementary operations between ART, Skip-Graphs, Chord and its 
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newest variations (F-Chord(a) [26], LPRS-Chord [30]), BATON [13] and its 
newest variation BATON* [14]. It is noted that c is a big positive constant. 

3 Our Solution 

First, we build the LRT (Level Range Tree) structure, one of the basic compo- 
nents of the final ART structure. LRT will be called upon to organize collections 
of peers at each level of ART. 



3.1 Building LRT structure 

LRT is built by grouping peers having the same ancestor and organizing them 
in a tree structure recursively. The innermost level of nesting (recursion) will be 
characterized by having a tree in which no more than b peers share the same 
direct ancestor, where b is a double-exponentially power of two (e.g. 2,4,16,...). 
Thus, multiple independent trees are imposed on the collection of peers. Figure 1 
illustrates a simple example, where 6 = 2. 




Fig. 1. The LRT structure for b=2 



The degree of the peers at level i > is d(i) = t(i), where t(i) indicates the 
number of peers at level i. It holds that d(0)=b and t(0)=l. Let n be iu-bit keys. 
Each peer with label i (where 1 < i < N) stores ordered keys that belong in the 
range [(£ — 1) Inn, i Inn— 1], where N = n/lnn is the number of peers. Note here 
that the Inn (and not logn) factor is due to a specific combinatorial game ([16]) 
we invoke in the next subsection. 

We also equip each peer with a table named Left Spine Index (LSI), which 
stores pointers to the peers of the left-most spine (see pointers starting from 
peer 5). 
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Furthermore, each peer of the left-most spine is equipped with a table named 
Collection Index (CI) , which stores pointers to the collections of peers presented 
at the same level (see pointers directed to collections of last level) . Peers having 
the same father belong to the same collection. For example, in Figure 1, peers 
8, 9, 10, and 11 constitute a certain collection. 



Lookup Algorithm Assume we are located at peer s (we mean the peer labeled 
by integer number s) and seek a key k. First, we find the range where k belongs 
in. Let say k £ [(j — 1) Inn, j Inn — 1]. The latter means that we have to search 
for peer j. The first step of our algorithm is to find the LRT level where the 
desired peer j is located. For this purpose, we exploit a nice arithmetic property 
of LRT. This property says that for each peer x located at the left-most spine 
of level i, the following formula holds: 

label(x) = label(father(x)) + b 2 *~ 2 (1) 

For example, peer 4 is located at level 2, thus 4 = father(4) + 2 or peer 8 is 
located at level 3, thus 8 = father(8)+4 or peer 24 (not depicted in the Figure 1) 
is located at level 4, thus 24 = father(2A) + 16. The last equation is true since 
father{24) = 8. 

Thus, for each level i (in the next subsection we will prove that < i < 
log log AT), we compute the label x of its left most peer by applying Equation (1). 
Then, we compare the label j with the computed label x. If j > x, we continue 
by applying Equation (1), otherwise we stop the loop process with current value 
i. The latter means that peer j is located at the z-th level. So, first we follow the 
i-th pointer of the LSI table located at peer s so as to reach the leftmost peer 
x of level i. Then, we compute the collection in which the peer j belongs. Since 
the number of collections at level i equals the number of peers located at level 
(i — 1), we divide the distance between j and x by the factor t(i — 1). Let m (in 

be the result of this division. The latter means that we 



particular m 



t(i-l) 

have to follow the (m + l)-th pointer of the CI table so as to reach the desired 
collection. Since the collection indicated by the CI[m+l] pointer is organized in 
the same way at the next nesting level, we continue this process recursively. 



Analysis The degree of the peers at level i > is d(i) = t(i), where t(i) 
indicates the number of peers at level i. It is defined that d(0)=b and t(0)=l. 
It is apparent that t(i) = t(i — l)d(i — 1), and, thus, by putting together the 
various components, we can solve the recurrence and obtain d(i) = t(i) = b 2 
for i > 1. This double exponentially increasing fanout guarantees the following 
lemma: 

Lemma 1: The height (or the number of levels) of LRT is 0(loglog 6 N) in the 
worst case. 

The size of the LSI table equals the number of levels of LRT. Moreover, the 
maximum size of the CI table appears at last level. It is apparent from the 
building of the LRT structure that at last level h, t(h) = O(N). It holds that 
t(h) = b 2h ~\ thus b 2 ^ 1 = O(N) orh-l = 0(loglog b N) or h = 0(loglog b N) + l. 
Since the number of collections at level h equals the number of peers located 
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at level (ft 1) we take t(h 1) = b^ = ^ w -"" + "- J or ^—"W = 

^(,..,0.^,2-! = ^o,io„^l1 lcmma 2 followg: 

Lemma 2: The maximum size of the CJ and LS7 tables is 0(V~N) and 0(log log AT) 
in worst-case respectively. 

We need now to determine what will be the maximum number of nesting 
trees that can occur for TV peers. Observe that the maximum number of peers 
with the same direct ancestor is d(h— 1). Would it be possible for a second level 
tree to have the same (or bigger) depth than the outermost one? 

This would imply that X^=o < ^C 1 — ■'■)■ 

As otherwise we would be able to fit all the d(h—l) peers within the first ft— 1 
levels. But we need to remember that d(i) — t(i), thus d(h — 1) + J2jZo < 
d(h-l). 

This would imply that the number of peers in the first ft — 2 levels is negative, 
clearly impossible. Thus, the second level tree will have depth strictly lower than 
the depth of the outermost tree. 

The innermost (let say j th ) level of nesting (recursion) is characterized by 
having a tree in which no more than b nodes share the same direct ancestor, where 
b is a double-exponentially power of two (e.g. 2,4,16,...). In this case b = N 1 ^ 3 
and the lemma 3 follows: 

Lemma 3: The maximum number of possible nestings in LRT structure is 
0(log b logA r ) in the worst case. 

At each peer we pay an extra processing cost by repeating the equation (1) 
O(loglogAf) times at most in order to locate the desired LSI pointer. Then, 
we need O(l) hops for locating the left-most peer x of the desirable level. We 
must note here that the processing overhead compared to communication over- 
head is negligible, thus we can ignore the 0(loglog N) processing factor at each 
peer. Finally we need 0(1) hops for locating the desirable collection of peers via 
the CT[m+l] pointer. Since, the collection indicated by the CI[m+l] pointer is 
organized in the same way at a next nesting level, we continue the above pro- 
cess recursively. According to lemma 2 the maximum number of nesting levels 
is <3(log b log N), and the theorem follows: 

Theorem 1: Exact- match queries in the LRT structure require 0(log fc log N) 
hops or lookup messages in the worst case. 

3.2 Building ART Structure 

We define as cluster.peer a bucket of 6>(polylog N') ordered peers, where N' is 
the number of cluster_peers. 

At initialization step we choose as cluster_peers the 1st peer, the (lnn+ l)-th 
peer, the (2 Inn + l)-th peer and so on. This means that each cluster.peer with 
label i 1 (where 1 < i' < N') stores ordered peers with sorted keys belonging in 
the range — 1) In 2 n, ... , i' In 2 n — 1], where N' = nj In 2 n is the number of 
cluster.peers. 

ART stores cluster_peers only, each of which is structured as an independent 
decentralized architecture. The backbone-structure of ART is exactly the same 
with LRT (see Figure 2). Moreover, instead of the Left- most Spine Index (LSI), 
which reduces the robustness of the whole system, we introduce the Random 
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Fig. 2. The ART structure for b=2 

Spine Index (RSI) routing table, which stores pointers to randomly chosen (and 
not to left-most) cluster _peers (see in Figure 2 the pointers starting from peer 3). 
In addition, instead of using fat CI tables, we access the appropriate collection 
of cluster_peers by using a 2-level LRT structure. The 2-level LRT is an LRT 
structure over log 2c Z buckets each of which organizes log fe z collections in a 
LRT manner, where Z is the number of collections at current level and c is a big 
positive constant (see Figure 3) 

Load Balancing We model the join/leave of peers inside a cluster_peer as the 
combinatorial game of bins and balls presented in [16] and the lemma 4 follows: 
Lemma 4: Given a /x(-) random sequence of join/leave peer operations, the 
load of each cluster _peer never becomes zero and never exceeds <9(polylog N 1 ) 
size in expected w.h.p. case. 

Routing Overhead ART stores cluster_peers, each of which is structured as 
an independent decentralized architecture (be it BATON*, Chord, Skip-Graph, 
e.t.c.) (see Figure 2). Here, we will try to avoid the existence of CI routing tables, 
since these tables may become very large (0(\^N)) in the worst case as well as 
the occurrence of local hot spots in the left-most spine results in a less robust 
decentralized infrastructure. Thus, instead of the Left-most Spine Index (LSI), 
we introduce the Random Spine Index (RSI) routing table. The latter table 
stores pointers to the cluster_peers of a random spine (for example, in Figure 2 
the randomly chosen cluster_peers 1,2,6 and 10 are pointed to by the RSI table of 
cluster _peer 3). Furthermore, instead of CI tables, we can access the appropriate 
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Ci denotes the i-th collection 



Fig. 3. The 2-level LRT structure 

collection of cluster.peers by using the 2-level LRT structure discussed above (see 
Figure 3). Since the larger number of collections is Z = 0(N 1 ^ 2 ) (it appears in 
the last level), the overhead of routing information is dominated by the second 

level structures in each of which we have an 0(^J j^f^) = O^N 1 ^ 4 / log c N) 

routing overhead. Thus, Theorem 2 follows: 

Theorem 2: The overhead of routing information in ART is 0(N 1 ^ A / log c N) 
in the worst case. 

Remark 1: If we use a k- level LRT structure, the routing information overhead 
becomes 0(N X ^ 2 /log c N) in the worst case. 

Lookup Algorithms Let us explain the lookup operations in ART. For ex- 
ample, in Figure 4 suppose we are located at cluster_peer 3 and we are looking 
for two keys, which are located at cluster _peers 19 and 119 respectively. The 
first step of our algorithm is to find the levels of the ART where the desired 
cluster .peers (e.g. 19 and 119) are located. In our example, the fourth and fifth 
levels are the desired levels. By following the RSI [4] and RSI[5] pointers we reach 
the cluster_peers 10 and 87 respectively. Now, we are starting from peers 10 and 
87 to lookup the peers 19 and 119 respectively in the 2-level LRT structures of 
the collections in respective levels. 

Generally speaking, since the maximum number of nesting levels is 0(log b log N) 
and at each nesting level i we have to apply the standard LRT structure in TV 1 / 2 * 
collections, the whole searching process requires Ti(iV) hops or lookup messages 
to locate the target cluster_peer, where: 

log b log N log b log N 

T 1 (N)= £ log 6 bg(JV 1 / 2< )=log 6 ( J] M^ 172 *)) (2) 

t=0 i=0 

where 

logs log 

J[ \og{N 1/r ) < (log A0 log6 log N 

i=0 

from which we get: 

Tr(iV) < log b ((logN) los > losN ) = 0(\og 2 b \ogN) 
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Then, we have to locate the target peer by searching the respective decentral- 
ized structure, requiring T2(N) hops. Since each of the known decentralized 
architectures requires a logarithmic number of hops, the total process requires 
T(N) = Ti(N) + T 2 (N) = 0(\og 2 b log AT) hops or lookup messages and the theo- 
rem follows. 

Theorem 3: Exact-match queries in the ART structure require 0{\og\ log N) 
hops or lookup messages. 

Having located the target peer for key ke and exploiting the order of keys 
on each node, range queries of the form [ke,k r ] require an 0(logl\ogN + \A\) 
complexity, where \A\ is the number of node-peers between the peers responsible 
for ki, k r respectively. The theorem follows. 

Theorem 4: Range queries of the form [ke, k r ] in the ART structure require an 
0(\ogl \ogN + \A\) complexity, where \A\ is the answer size. 



Fig. 4. An example of Lookup Steps via RSI[ ] tables and 2-level LRT structures 



Query Processing, Data Insertion and Data Deletion, Peer Join and 
Peer Departure In the following we briefly present the basic routines for query 
processing, data insertion and data deletion, peer join and peer departure. 

The RangeJ3earch(s,ki,k r ) routine (Algorithm 1) gets as input the peer s 
in which the query is initiated and the respective range of keys [ke,k r ] and 
returns as output the id of the cluster_peer S, which contains peer s as well 
as the cluster_peer W in which the key ke belongs. Then, it calls the basic 
ART_Lookup(T, S,idS,W,idW ) routine, in order to locate the target peer re- 
sponsible for key ke, and then, exploiting the order of keys on each peer performs 




2-level LRT 
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Algorithm 1 Range_Search(s,/c£,fc r ,A) 

1: Input: s,ke,k r (we are at peer s and we are looking for keys in range [kt, k r ]) 

2: Output: idW (the identifier of cluster-peer W, which stores kt key), A (the answer) 

3: BEGIN 

4: We compute idS*:the identifier of Cluster_peer S, which contains peer s; 
5: We compute idW:\et j be the identifier of target Cluster_peer W, which stores ke 
key; 

6: Let T the basic ART structure of cluster-peers; 

7: W=ART_Lookup(T, S, idS, W, idW); {call of the basic routine} 

8: A=Linear_Scan of all Cluster_peers located in and right to W until we find a 

k&y ^> k r . 
9: END 



a right linear jscan till it finds a key > k r . 

The ART_Lookup(T, S,idS,W,idW) routine (Algorithm 2) gets as input the 
cluster.peer S (with identifier idS) in which the query is initiated and returns 
as output the id (idW) of the cluster_peer W in which the key ke belongs. T 
denotes the ART-trce structure. Moreover, Algorithm 2 requires (3(log^ log iV) 
hops, according to first part (Ti(iV)) of Theorem 3. Obviously, the same com- 
plexity holds for insert /delete key operations (see Algorithms 3 and 4), since we 
have to locate the target peer into which the key must be inserted or deleted. 

For join (depart) peer operations (for details see Algorithm 5), we need 
0(log^ log N) + Tj i n (N) (O(logblogTV) +T depart (N)) lookup messages, where 
Tj i n (N) (Td epar t(N)) is the number of hops required from the respective decen- 
tralized structure for peer-join (peer-departure). 

In the peer join algorithm we assumed that the new peer is accompanied by 
a key, and this key designates the exact position in which the new peer must be 
inserted. If an empty peer u makes a join request at a particular peer v (which 
we call entrance peer) then there is no need to get to a different cluster peer than 
the one in which u belongs. Similarly, the algorithm for the departure of a peer 
u assumes that the request for departure of peer u can be made from any peer 
in the ART-structure. This may not be desirable, and in many applications it 
is assumed that the choice for departure of peer u can be made only from this 
peer. Of course, in this way the algorithm for peer departure is simplified since 
there is no need to traverse the ART structure but only the cluster peer in which 
u belongs. In order to bound the size of each cluster_peer we assume that the 
probability of picking an entrance peer is equal among all existing peers, and 
that the probability of a peer departing is equal among all existing peers in the 
ART. Since the size of the cluster _peer is bounded by polylogN expected w.h.p., 
the following theorem is established: 

Theorem 5: The peer join/departure can be carried out in 0{loglogN) hops 
or lookup messages. 

Node Failure, Fault Tolerance, Network Restructuring and Load Bal- 
ancing Since we have modeled the join/leave of peers inside a cluster_peer as 
the combinatorial game of bins and balls presented in [16], each cluster _peer of an 
ART structure (according to lemma 4) never exceeds a polylogarithmic number 
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Algorithm 2 ART_Lookup(T, S, idS, W, idW) 
1: Input: We are at cluster-peer S with identifier idS 
2: Output: We are looking for the cluster-peer W with identifier idW 
3: BEGIN 

4: If (S is responsible for ke) 
5: Return S; 
6: Else 

7: If W=\ then i=0; 

8: Else if W € {2, 3, . . . , b + 1] then i=l; 

9: Else 

10: x=b+2; 

11: For (t = 2;i < Ci/og log 6 iV; + + i) 

12: x = father (x) + b 2% ~ 2 ■ 

13: If j < a; then break( ); 

14: Follow the RSI[i] pointer of cluster _peer S; 

15: Let X the correspondent cluster_peer; 

16: Search for W the 2-level LRT structure starting from X; 

17: Let Y the first cluster-peer of the correspondent collection; 

18: Let T' the ART structure of the collection above at next level of nesting with root 

the cluster-peer Y; 
19: S = Y; 

20: ART .Lookup (T', S,idS, W,idW); {recursive call of the basic routine} 
21: Return W; 
22: END 



Algorithm 3 ART_inscrt(T, s, k) 

1: Input: We are at peer s and we want to insert the key k 
2: Output: The peer w in which k must be inserted 
3: BEGIN 

4: We compute idS:the identifier of Cluster_peer S, which contains peer s; 
5: We compute idW:\et j be the identifier of target Cluster_peer W, which stores the 
k key; 

6: ARTJLookup(T, S, idS, W,idW); 
7: Let W the target cluster_peer; 
8: Search W for peer w containing k; 
9: If k does not exist into w, then insert k into it; 
10: END 



Algorithm 4 ART.dclctc(T, s, k) 
1: Input: We are at peer s and we want to delete the key k 
2: Output: The peer w in which k must be deleted 
3: BEGIN 

4: We compute idS:the identifier of Cluster_peer S, which contains peer s; 
5: We compute idWilet j be the identifier of target Cluster_peer W, which stores the 
k key; 

6: ARTJLookup(T, S, idS, W,idW); 
7: Let W the target cluster_peer; 
8: Search W for peer w containing k; 
9: If k exists into w, then delete it; 
10: END 
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Algorithm 5 ART_join/leave_peer(T, s, w) 

1: Input: We are at peer s and we want to insert/delete the new peer w 

2: Output: The cluster_peer W in which the peer w must be inserted/deleted 

3: BEGIN 

4: We compute idS:the identifier of Cluster_peer S, which contains peer s; 
5: We compute idW:\et j be the identifier of target Cluster_peer W, which contains 
peer w; 

6: ART_Lookup(T, S, idS, W, idW); {call of the basic routine} 
7: Let W the target cluster_peer; 
8: Insert/delete w into/from W; 
9: END 



of peers and never becomes empty in expected case with high probability. The 
latter means that the skeleton ART structure of cluster _peers remains unchanged 
in the expected case with high probability as well as in each cluster_peer the al- 
gorithms for peer failure, network restructuring and load balancing are according 
to the polylogarithmic-sized decentralized architecture we use. 

Multi-attribute Queries As in [14], we divide the whole range of attributes 
into several sections: each section is used to index an attribute (if it appears 
frequently in queries) or a group of attributes (if these attributes rarely appear 
in queries). Since ART can only support queries over one-dimensional data, if we 
index a group of attributes, we have to convert their values into one-dimensional 
values (by choosing Hilbert space filling curve or other similar methods). For 
example, if we have a system with 12 attributes: a\, a,2, • • • , an hi which only 
4 attributes from a\ to are frequently queried (i.e. 90% of all queries), we 
can build 4 separate indexes for them. The remaining attributes can be divided 
equally into two groups to index, four attributes in each group. This way, the 
number of replications can be significantly reduced from 12 down to 6. 

4 Evaluation 

For evaluation purposes we used the Distributed Java D-P2P-Sim simulator 
presented in [27]. The D-P2P-Sim simulator is extremely efficient delivering 
> 100,000 cluster peers in a single computer system, using 32-bit JVM 1.6 
and 1.5 GB RAM and full D-P2P-Sim GUI support. When 64-bit JVM 1.6 and 

5 RAM is utilized the D-P2P-Sim simulator delivers > 500, 000 cluster peers 
and full D-P2P-Sim GUI support in a single computer system. When D-P2P- 
Sim simulator acts in a distributed environment with multiple computer systems 
with network connection delivers multiple times the former population of cluster 
peers with only 10% overhead. 

Our experimental performance studies include a detailed performance compar- 
ison with BATON*, one of the state-of-the-art decentralized architectures. In 
particular, we implemented each cluster_peer as a BATON* [14], the best known 
decentralized tree-architecture. We tested the network with different numbers of 
peers ranging up to 500,000. A number of data equal to the network size multi- 
plied by 2000, which are numbers from the universe [1.. 1,000, 000, 000] are inserted 
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to the network in batches. The synthetic data (numbers) from this universe were 
produced by the following distributions: beta, uniform and power-law. For each 
test, 1,000 exact match queries and 1,000 range queries are executed, and the 
average costs of operations are taken. Searched ranges are created randomly by 
getting the whole range of values divided by the total number of peers multiplies 
a, where a £ [1..10]. Note that in all experiments the default value of parameter 
b is 4. The source code of the whole evaluation process is publicly available 6 . 

4.1 Single- and Multi-attribute Query Performance 
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Fig. 5. Cost of exact match query (left) and cost of range query (right). 



As proved previously, the whole query performance of ART is 0(log 2 log A''') 
where the N' cluster_peers structure their internal peers according to the BA- 
TON* architecture. For normal, beta and uniform distributions each cluster_peer 
contains 0.75 log 2 A peers on average and for power- law distributions each clus- 
ter_peer contains 2.5 log 2 A peers on average. Thus, in the former case the av- 
erage number of cluster .peers is N' = Q 75 N , whereas in the latter case 

the number of cluster_peers becomes A' = 2 5 \o^' 2 n on avera S e - I n ai l cases, 
ART outperforms BATON* by a wide margin. As depicted in Figure 5 (up), 
our method is almost 2 times faster and as a consequence we have a 50% im- 
provement. The results are analogous with respect to the cost range queries as 
depicted in Figure 5 (down). 

Figure 6 (up) depicts the cost of updating routing tables. Since each cluster_peer 
structures 0( A/poly log A) (and not O(N)) peers according to BATON* archi- 
tecture, the results are as expected. We remark that BATON* requires m log m A 
hops, whereas m log m poly log A hops are required by ART. In particular and as 
depicted in Figure 6 (up), our method updates the routing tables 3 or 4 times 
faster. Figure 6 (down) depicts the insertion cost in multi-attribute case, where 
we have 6 separate indexes. BATON* requires 6 log A hops and ART requires 
6 log 2 log(A/polylog A) + 6 log (poly log A) hops. We observe that the insertion 
cost of ART is the lowest for any distribution. Again, our method is almost 2 
times faster. Finally, the results are analogous for multi-attribute exact-match 
and range queries respectively (see Figures 7 (up) and 7 (down)). 



6 http: //code .google . com/p/d-p2p-sim/ 
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Fig. 6. Cost of updating routing tables (left) and cost of insertion (right). 
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Fig. 7. Cost of multi-attribute exact-match (left) and range queries (right). 



4.2 Load Balancing 

ART not only reduces the search cost but also achieves better load balancing. To 
verify this claim, we test the network with a variety of distributions and evaluate 
the cost of load balancing. For simplicity, in our system, we assume that the query 
distribution follows the data distribution. As a result, the workload of a peer is 
determined only by the amount of data stored at that peer. In BATON*, when 
a peer joins the network, it is assigned a default upper and lower load limit by 
its parent. If the number of stored data at the peer exceeds the upper bound, 
it is considered as an overloaded peer and vice versa. If a peer is overloaded 
and cannot find a lightly loaded leaf peer, it is likely that all other peers also 
have the same work load; thus, it automatically increases the boundaries of 
storage capability. In ART the overlay of cluster_peer remains unaffected in the 
expected case with high probability when peers join or leave the network. Thus, 
the load-balancing performance is restricted inside a cluster_peer (which is a new 
BATON* structure) and as a result ART needs no more than 4 lookup-messages 
(instead of 1000 messages needed from BATON* in case of 500.000 nodes). For 
details see Figure 8 (up). 

4.3 Fault Tolerance 

To evaluate the system's fault tolerance in case of massive failure we initialized 
the system with 10,000 peers. In the sequel, we let peers randomly fail step by 
step without recovering. At each step, we check to see if the network is parti- 
tioned or not. With massive peer failures, we face a massive destruction of links 
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Fig. 8. Cost of load-balancing (left) and search cost in case of massive failure (right). 



connected to failed peers. Since the search process has to bypass these peers, 
the search query has to be forwarded forth and back several times to find a way 
to the destination and as a result the search cost is expected that will increase 
substantially. Since the backbone of ART structure remains unaffected w.h.p., 
meaning that there is always a peer for playing the role of cluster representative, 
the search cost is restricted inside a cluster_peer (which is a BATON* struc- 
ture) and as a result ART needs no more than 32 lookup-messages (instead of 
180 messages needed from BATON* in case of 6.000 nodes). Figure 8 (down) 
illustrates this effect. 

5 Trade-offs and Heuristics 

If each collection of cluster_peers is organized individually as a BATON* struc- 
ture (not the whole level of collections) , then we can climb up the ART structure 
until we reach the nearest common ancestor of the cluster_peer we are located 
in, and the cluster peer we are searching. Then a downwards traversal is ini- 
tiated to reach this cluster_peer. Since, each collection of i th -\evel is organized 
according to BATON*, we can decide in Oilog^^^n 1 / 21 ) hops the child we must 
follow for further searching. As a result, the total time becomes 0(log m n) and 
no improvement has been achieved. 

In our solution, if we parameterize the size of the buckets (depicted in Figure 
3) from 0(log 2c A) to 0(log 2/(JV) N), where f(N) is a function of the network 
size, then we can get an interesting trade-off between the routing data over- 
head and the number of hops for an operation. In particular, if Z is the num- 
ber of collections at the current level, then each bucket contains Q( t 2 /fjvj N ) 

collections. Thus, the first LRT layer organizes 0(\og 2f{N) N) bucket represen- 
tatives and each second LRT layer organizes 0( log2 /fw) N ) collections. In this 
case, the routing overhead is dominated by the second layer LRTs which be- 
comes O( log j (w) N ). To achieve an optimal routing data overhead we would like 

the following: O( lo j ( «) N ) = 0(1) & f(N) = O(logiV). In this case the first 

LRT layer contains 0(\og 2 ^ N ^ N) or O(log 21og N N) bucket representative nodes. 
Therefore, a lookup operation in first layer requires 0(loglog(log 21ogAr N)) or 
w(loglogA) hops. Each of the second layer LRTs contains 0( 2f ^ N) N ) collec- 
tion representative nodes, where Z is the number of collections at current level. 



Autonomous Range Tree 17 



Therefore, the number of hops required by a lookup operation in second layer 
is O(loglogiV). So, the total time becomes a; (log log AT) and the sub-logarithmic 
complexity is not guaranteed. As a result, if we want an optimal routing over- 
head we cannot guarantee sub-logarithmic complexity. If we relax the routing 
overhead to be of polynomial size then we can achieve this. 

In our solution the routing data overhead ((^(N 1 / 4 / log c N)) is a polyno- 
mial function. However, in reality even for an extremely large number of peers 
N=l. 000. 000. 000, the routing data overhead is 6 for c = 1, which is less than the 
fanout of BATON* (m = 10) that we used to run our experiments. The latter 
demonstrates the significance of our result. 

6 Conclusions 

We presented a new efficient decentralized infrastructure for range query pro- 
cessing with probabilistic guarantees, the ART structure. Theoretical analysis 
showed that the communication cost of query, update and join/leave node op- 
erations scale sub- logarithmically expected w.h.p.. Experimental performance 
comparison with BATON*, the state-of-the-art decentralized structure, showed 
the improved performance, scalability and efficiency of our new method. Finally, 
we believe that ART will enable general purpose decentralized trees to support 
a wider class of queries, and then broaden the horizon of their applicability. 
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